Red Wine Data Exploration by Simone Romero

This project is part of the “Explore and Summarize Data” module from Udacity’s Data Scientist Nanodegree Program.

To develop this project the chosen data set was Red Wine Quality, which is public available for research, and more details are described in:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The exploratory analysis will be guided by the following question: Which chemical properties influence the quality of red wines?

Dataset overview

The dataset variables are:

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume) Output variable (based on sensory data):
  12. quality (score between 0 and 10)

The variable types are:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We can observe that there are discrete and continuous variables, and the X variable is just an index for each observation in the dataset, so let’s remove it.

red_wine <- within(red_wine, rm(X))

Let’s see the distribution of our variables in the dataset.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The variables fixed.acidity, volatile.acidity, citric.acid, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide presented high dispersion, which may mean the existence of outliers.

Regarding to the wine quality, ratings are among 3 and 8, being 6 the median quality value.

Univariate Plots Section

Towards an univariate analysis, let’s plot some histograms to understand the structure of the individual variables in the dataset.

Density and pH plots presented a normal distribution, while citric.acid, free.sulfer.dioxide, and total.sufer.dioxide presented a right skewed distribution. Outliers can be observed mainly for residual.sugar and chlorides plots.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Given the summary and the plot of the quality feature, we can observe that most of the observations are classified as 5 or 6, which represents the median. Few examples were classified between 3 and 4, and 7 and 8, which represents the wines of low and high quality, respectively. Based on that, the data was grouped into 3 categories: low (< 5), average (< 7), and high ( > 7), as shown in the plot below.

##     low average    high 
##      63    1319     217

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, named as quality.

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality since it represents the experts’ opinion about the wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine are the features that can support the investigation since they are the features that contribute most to the smell and taste of wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created the ‘rating’ variable, which is a categorical representation of wine quality: low (< 5), average (< 7), and high ( > 7).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I have removed the X variable, which represented the dataset index.

Bivariate Plots Section

In this section, we are going to explore the following features: volatile.acidity, citric.acid, total.sulfur.dioxide, pH, and alcohol.

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5650  0.6800  0.7242  0.8825  1.5800 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0200  0.0800  0.1737  0.2700  1.0000 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2400  0.2583  0.4000  0.7900 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   13.50   26.00   34.44   48.00  119.00 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   24.00   40.00   48.95   65.00  165.00 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.00   27.00   34.89   43.00  289.00

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.380   3.384   3.500   3.900 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.210   3.310   3.311   3.400   4.010 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Based on the boxplots, it was clear the relationship between the pH scale and the citric.acid values. With a lower pH, the citric value increases as the wine becomes more acidic, and wines with higher acidic level (pH < 3.27) have received the ‘high’ rating.

The plot below shows a negative correlation of -0.5419 between pH and citric.acid features.

## [1] -0.5419041

Alcohol and citric.acid presented important roles in the high quality wines, however there is no particular striking relationship between both features (positive correlation of 0.1099), as presented below.

## [1] 0.1099032

Still trying to relate the acidity to the alcohol level of the wine, the alcohol and the pH features presented a positive correlation of 0.2056, as shown in the following plot.

## [1] 0.2056325

Using a different feature, the plot below shows the relationship between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.

## [1] -0.4961798

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The feature volatile.acidity represents the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. Thus, in the boxplots it is possible to observe the high relationship between this feature and the quality rating, since wines with elevated volatile.acidity obtained low quality rating, whereas wines with lower volatile.acidity obtained high quality rating. For the wines that obtaines high as quality rating, the 3rd Quartile value (0.4900) is lower than the median value (0.5400) of the boxplot that represents the wines that obtained average quality rating. In other words, the concentration of volatile.acidity is lower for the high quality wines.

The median values presented in the boxplots of the citric.acid feature were: 0.0800 for low, 0.2400 for average, and 0.4000 for high quality rating. This means that high quality wines present higher concentration of citric.acid, which is inversely proportional to that presented in the volatile.acidity plots.

For the total.sulfur.dioxide feature, there was not a clear correlation to the quality feature, since the boxplots presentes very close median for low and high quality wines (26.00 and 27.00, respectively).

The pH feature describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), most wines are between 3-4 on the pH scale. According to the boxplots, wines with pH above 3.380 are considered of low quality, whereas wines with pH scale lower than 3.310 or 3.270 can be considered of average or high quality, respectively. In other words, acidic wines are better.

Regarding the percent alcohol content of the wine, the boxplots show that the higher the percentage of alcohol the better.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile.acidity and citric.acid presented a high negative correlation (-0.5524).

Although wines with higher alcohol content and higher acidity have received the high quality classification, the relationship between these features is not very significant, being a positive correlation of 0.2056 between alcohol and pH, and of 0.1099 between alcohol and citric.acid.

Other interesting relationship was observed between alcohol and density. They presented a negative correlation of -0.4961. In other words, the higher the alcohol level, the lower the density of wine.

What was the strongest relationship you found?

The strongest relationship was found for volatile.acidity and citric.acid, they presented a negative correlation of -0.5524.

Multivariate Plots Section

Based on the results of the previous section, when comparing citric.acid and volatile.acidity, we observed that most of the high quality wines presented high citric.acid concentration and low volatile.acidity concentration. The reverse is true for wines that have obtained low quality rating.

The pH and alcohol features were also analyzed previously. In the plot below it is possible to see how the highest pH contributed to the low classification rating of red wines.

It seems that alcohol is an important characteristic for classification, so we compare this variable with others that may directly impact the high or low quality rating of a wine.

In the following plot, we observed that low quality wines have higher density and low alcohol level.

For alcohol and volatile.acidity features it is clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality.

Other important feature is citric.acid, however when comparing it to alcohol, there is nothing too striking about the concentration of these features to producing low or high quality wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate analysis six features were considered: alcohol, pH, volatile.acidity, citric.acid, density, and rating (categorical for quality).

When grouped together, the role of each of these chemical properties in the manufacture of high quality wines is evident:

  • High citric.acid concentration and low volatile.acidity concentration.
  • High alcohol level and low pH scale.
  • High alcohol level and low density.

Considering the important role of alcohol level, we also compared it with other features. When compared to volatile.acidity it was clear that low volatile.acidity and high alcohol level are very important to the wine classification as high quality. However, when alcohol was plotted with citric.acid, no clear relationship was observed.

Were there any interesting or surprising interactions between features?

I was surprised that there was no clear relationship between alcohol and citric.acid.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

## red_wine$rating: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## red_wine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## red_wine$rating: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Description One

This plot is interesting because the boxplots show that the higher the percentage of alcohol the higher the quality of wine. The median alcohol level for high quality wine is 11.60 and the mean is 11.52. For the low quality wines, the 3rd Quartile was 11.00.

Plot Two

Description Two

In this plot it is possible to observe the importance of volatile.acidity and citric.acid to obtain high quality wines. Most of the high quality wines (yellow points) presented high citric.acid concentration and low volatile.acidity concentration, whereas the low quality wines (violet points) presented low citric.acid concentration and high volatile.acidity value.

Plot Three

Description Three

Results similar to those presented in the previous graphs can be observed here when we compare the level of alcohol with the volatile.acidity. A high concentration of acid volatility contributes to the production of low quality wines, while high alcohol content contributes to the production of high quality wines.


Reflection

The dataset analyzed contains 1,599 observations of different types of red wines and 11 chemical properties were considered in the analysis. Thus, the original dataset is composed of 12 features being 11 chemical properties and the score given by the experts, namely as quality.

The quality score range from 1 to 10. Given the summary of this feature, we observed that most of the instances are classified as 5 or 6 and only a few ones were classified between 3 and 4, and 7 and 8. Based on that, the data was grouped into 3 categories, namely as: low (for quality score less than 5), average (for quality score less than 7), and high ( for quality score higher than 7).

Based on an initial analysis, volatile.acidity, citric.acid, total.sufer.dioxide, pH, and the percent alcohol of the wine were the features that considered to support the investigation since they are the features that contribute most to the smell and taste of wine.

Based on the plots produced, it was possible to observe that not all the features presented a definitive role in the wines classification. Volatile.acidity, citric.acid, and alcohol level are the ones that stood out the most.

Considering the process itself, it was very important to note that even the dataset containing not so many features, not all are representative for the classification task. In addition, this whole process of exploiting the data through graphics is laborious but can save us a lot of time during modeling.

References

https://github.com/agapic/Data-Analyst-Nanodegree-Udacity/tree/master/Project%204%20-%20Explore%20and%20Summarize%20Data%20with%20R

https://github.com/baocongchen/Explore-and-Summarize-Data/blob/master/projectTemplate.Rmd

https://github.com/BlaneG/explore-and-summarize-data/blob/master/Red_Wine_Analysis.Rmd